--- title: esol keywords: fastai sidebar: home_sidebar summary: "Using `molmapnets` for regression, tested on the `eSOL` dataset." description: "Using `molmapnets` for regression, tested on the `eSOL` dataset." nb_path: "03_train.ipynb" ---
{% raw %}
{% endraw %} {% raw %}
 
{% endraw %} {% raw %}
%config Completer.use_jedi = False
{% endraw %} {% raw %}
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set_theme(palette='Set2')
colors = sns.color_palette()
colors
{% endraw %} {% raw %}
{% endraw %} {% raw %}
from chembench import dataset
from molmap import MolMap

from molmapnets.models import MolMapRegression
RDKit WARNING: [21:38:06] Enabling RDKit 2019.09.3 jupyter extensions
{% endraw %}

Feature extraction

The chembench package collected several different datasets for benchmarking the models. Here we'll use the eSOL dataset, which collects the solubility of all E.coli proteins. The data can be loaded with

{% raw %}
data = dataset.load_ESOL()
total samples: 1128
{% endraw %}

We have the smiles (Simplified Molecular Input Line Entry Specification) for different proteins and their corresponding solubility meansure:

{% raw %}
data.df.head()
smiles measured log solubility in mols per litre
0 OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)... -0.77
1 Cc1occc1C(=O)Nc2ccccc2 -3.30
2 CC(C)=CCCC(C)=CC(=O) -2.06
3 c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43 -7.87
4 c1ccsc1 -1.33
{% endraw %}

Using MolMap we can extract features from using the smiles as input. We can specify the feature type ftype, feature pairwise distance calculation method metric, and feature grid arrangement method fmap_type:

{% raw %}
MolMap?
Init signature:
MolMap(
    ftype='descriptor',
    flist=None,
    fmap_type='grid',
    fmap_shape=None,
    split_channels=True,
    metric='cosine',
    var_thr=0.0001,
)
Docstring:      <no docstring>
Init docstring:
paramters
-----------------
ftype: {'fingerprint', 'descriptor'}, feature type
flist: feature list, if you want use some of the features instead of all features, each element in flist should be the id of a feature
fmap_shape: None or tuple, size of molmap, only works when fmap_type is 'scatter', if None, the size of feature map will be calculated automatically
fmap_type:{'scatter', 'grid'}, default: 'gird', if 'scatter', will return a scatter mol map without an assignment to a grid
split_channels: bool, if True, outputs will split into various channels using the types of feature
metric: {'cosine', 'correlation'}, default: 'cosine', measurement of feature distance
var_thr: float, defalt is 1e-4, meaning that feature will be included only if the conresponding variance larger than this value. Since some of the feature has pretty low variances, we can remove them by increasing this threshold
File:           ~/git/bidd-molmap/molmap/map.py
Type:           type
Subclasses:     
{% endraw %} {% raw %}
descriptor = MolMap(ftype='descriptor', metric='cosine',)
{% endraw %} {% raw %}
fingerprint = MolMap(ftype='fingerprint', metric='cosine')
{% endraw %}

After setting up the feature extracting method, we can then use the .fit method of the feature object to extract the features. During this step we need to specify the algorithm (method) to embed higher dimensional features to 2D presentation:

{% raw %}
descriptor.fit(verbose=0, method='umap', min_dist=0.1, n_neighbors=15,)
2021-07-22 21:38:23,335 - INFO - [bidd-molmap] - Applying grid feature map(assignment), this may take several minutes(1~30 min)
2021-07-22 21:38:26,510 - INFO - [bidd-molmap] - Finished
{% endraw %} {% raw %}
fingerprint.fit(verbose=0, method='umap', min_dist=0.1, n_neighbors=10,)
/Users/olivier/.local/lib/python3.6/site-packages/umap/umap_.py:1461: UserWarning: Using precomputed metric; transform will be unavailable for new data
  "Using precomputed metric; transform will be unavailable for new data"
2021-07-22 21:38:54,356 - INFO - [bidd-molmap] - Applying grid feature map(assignment), this may take several minutes(1~30 min)
2021-07-22 22:11:46,397 - INFO - [bidd-molmap] - Finished
{% endraw %}

And we can then visualise the feature maps

{% raw %}
descriptor.plot_grid()
2021-07-22 22:11:46,423 - INFO - [bidd-molmap] - generate file: ./descriptor_1344_cosine_umap_molmap
2021-07-22 22:11:46,504 - INFO - [bidd-molmap] - save html file to ./descriptor_1344_cosine_umap_molmap
{% endraw %} {% raw %}
fingerprint.plot_grid()
2021-07-22 22:11:46,553 - INFO - [bidd-molmap] - generate file: ./fingerprint_15846_cosine_umap_molmap
2021-07-22 22:11:46,866 - INFO - [bidd-molmap] - save html file to ./fingerprint_15846_cosine_umap_molmap
{% endraw %} {% raw %}
descriptor.plot_scatter()
2021-07-22 22:11:47,061 - INFO - [bidd-molmap] - generate file: ./descriptor_1344_cosine_umap_scatter
2021-07-22 22:11:47,117 - INFO - [bidd-molmap] - save html file to ./descriptor_1344_cosine_umap_scatter
{% endraw %} {% raw %}
fingerprint.plot_scatter()
2021-07-22 22:11:47,149 - INFO - [bidd-molmap] - generate file: ./fingerprint_15846_cosine_umap_scatter
2021-07-22 22:11:47,523 - INFO - [bidd-molmap] - save html file to ./fingerprint_15846_cosine_umap_scatter
{% endraw %}

Regression using the descriptor map

{% raw %}
X = descriptor.batch_transform(data.x)
100%|##########| 1128/1128 [08:45<00:00,  2.35it/s]
{% endraw %} {% raw %}
X.shape
(1128, 37, 37, 13)
{% endraw %}

In PyTorch the training data for computer vision problems takes the shape (n_channels, hight, width), while the features extracted from MolMap take the shape (hight, width, n_channels), so we'll first correct it by moving the channels dimension before the feature map dimensions.

{% raw %}
torch.movedim(torch.from_numpy(X), -1, 1).shape
torch.Size([1128, 13, 37, 37])
{% endraw %} {% raw %}
Y = data.y
{% endraw %} {% raw %}
Y.shape
(1128, 1)
{% endraw %}

Now from these feature maps we can create the dataset suitable for training models in PyTorch

{% raw %}

class SingleFeatureData[source]

SingleFeatureData(*args, **kwds) :: Dataset

Process single feature map for model training. y: target X: feature map

{% endraw %} {% raw %}
{% endraw %} {% raw %}
esol = SingleFeatureData(data.y, X)
{% endraw %} {% raw %}
train, val, test = random_split(esol, [904,112,112], generator=torch.Generator().manual_seed(7))
{% endraw %} {% raw %}
len(train), len(val), len(test)
(904, 112, 112)
{% endraw %} {% raw %}
train_loader = DataLoader(train, batch_size=8, shuffle=True)
val_loader = DataLoader(val, batch_size=8, shuffle=True)
test_loader = DataLoader(test, batch_size=8, shuffle=True)
{% endraw %}

And we can get one batch of data by making the data loader iterable

{% raw %}
x, t = next(iter(train_loader))
{% endraw %} {% raw %}
t
tensor([[-2.3560],
        [-2.4840],
        [-1.0600],
        [ 0.3900],
        [ 0.9600],
        [-2.1700],
        [-2.7300],
        [-0.6200]])
{% endraw %} {% raw %}
x.shape
torch.Size([8, 13, 37, 37])
{% endraw %}

Finally with the data prepared we can train the models. These are tests to show that the models work as expected, but we can certainly fine tune the model to achieve better results.

{% raw %}
model = MolMapRegression()

epochs = 5
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()
{% endraw %}

And the training loop

{% raw %}
for epoch in range(epochs):

    running_loss = 0.0
    for i, (xb, yb) in enumerate(train_loader):

        xb, yb = xb.to(device), yb.to(device)

        # zero gradients
        optimizer.zero_grad()

        # forward propagation
        pred = model(xb)

        # loss calculation
        loss = criterion(pred, yb)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if (i+1) % 50 == 0:    
            print('[Epoch: %d, Iter: %5d] Training loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / (i+1)))

print('Training finished')
/Users/olivier/opt/anaconda3/envs/molmap/lib/python3.6/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  ../c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
[Epoch: 1, Iter:    50] Training loss: 6.567
[Epoch: 1, Iter:   100] Training loss: 5.444
[Epoch: 2, Iter:    50] Training loss: 3.304
[Epoch: 2, Iter:   100] Training loss: 2.579
[Epoch: 3, Iter:    50] Training loss: 2.061
[Epoch: 3, Iter:   100] Training loss: 1.807
[Epoch: 4, Iter:    50] Training loss: 1.179
[Epoch: 4, Iter:   100] Training loss: 1.294
[Epoch: 5, Iter:    50] Training loss: 1.191
[Epoch: 5, Iter:   100] Training loss: 1.397
Training finished
{% endraw %}

Loss on validation data set

{% raw %}
running_loss = 0.0
with torch.no_grad():
    for i, (xb, yb) in enumerate(val_loader):

        xb, yb = xb.to(device), yb.to(device)

        # forward propagation
        pred = model(xb)

        # loss calculation
        loss = criterion(pred, yb)
        running_loss += loss.item()
        if (i+1) % 3 == 0:    
            print('[Iter: %5d] Validation loss: %.3f' %
                    (i + 1, running_loss / (i+1)))
[Iter:     3] Validation loss: 0.732
[Iter:     6] Validation loss: 0.950
[Iter:     9] Validation loss: 0.959
[Iter:    12] Validation loss: 0.983
{% endraw %}

Regression using the fingerprint map

{% raw %}
X_fingerprint = fingerprint.batch_transform(data.x)
100%|##########| 1128/1128 [03:37<00:00,  5.55it/s]
{% endraw %} {% raw %}
X_fingerprint.shape
(1128, 126, 126, 12)
{% endraw %}

Now from these feature maps we can create the dataset suitable for training models in PyTorch

{% raw %}
esol_fingerprint = SingleFeatureData(data.y, X_fingerprint)
{% endraw %} {% raw %}
train_fingerprint, val_fingerprint, test_fingerprint = random_split(esol_fingerprint, [904,112,112], generator=torch.Generator().manual_seed(7))
{% endraw %} {% raw %}
len(train), len(val), len(test)
(904, 112, 112)
{% endraw %} {% raw %}
train_loader_fingerprint = DataLoader(train_fingerprint, batch_size=8, shuffle=True)
val_loader_fingerprint = DataLoader(val_fingerprint, batch_size=8, shuffle=True)
test_loader_fingerprint = DataLoader(test_fingerprint, batch_size=8, shuffle=True)
{% endraw %}

And we can get one batch of data by making the data loader iterable

{% raw %}
x, t = next(iter(train_loader_fingerprint))
{% endraw %} {% raw %}
t.shape
torch.Size([8, 1])
{% endraw %} {% raw %}
x.shape
torch.Size([8, 12, 126, 126])
{% endraw %}

And regression. Different feature maps have different number of channels.

{% raw %}
model_fingerprint = MolMapRegression(conv_in1=12)

epochs = 5
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model_fingerprint.to(device)
optimizer = optim.Adam(model_fingerprint.parameters(), lr=0.001)
criterion = nn.MSELoss()
{% endraw %}

And the training loop

{% raw %}
for epoch in range(epochs):

    running_loss = 0.0
    for i, (xb, yb) in enumerate(train_loader_fingerprint):

        xb, yb = xb.to(device), yb.to(device)

        # zero gradients
        optimizer.zero_grad()

        # forward propagation
        pred = model_fingerprint(xb)

        # loss calculation
        loss = criterion(pred, yb)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if (i+1) % 50 == 0:    
            print('[Epoch: %d, Iter: %5d] Training loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / (i+1)))

print('Training finished')
[Epoch: 1, Iter:    50] Training loss: 4.537
[Epoch: 1, Iter:   100] Training loss: 3.935
[Epoch: 2, Iter:    50] Training loss: 2.271
[Epoch: 2, Iter:   100] Training loss: 1.954
[Epoch: 3, Iter:    50] Training loss: 1.258
[Epoch: 3, Iter:   100] Training loss: 1.157
[Epoch: 4, Iter:    50] Training loss: 1.027
[Epoch: 4, Iter:   100] Training loss: 0.880
[Epoch: 5, Iter:    50] Training loss: 0.660
[Epoch: 5, Iter:   100] Training loss: 0.649
Training finished
{% endraw %}

Loss on validation data set

{% raw %}
running_loss = 0.0
with torch.no_grad():
    for i, (xb, yb) in enumerate(val_loader_fingerprint):

        xb, yb = xb.to(device), yb.to(device)

        # forward propagation
        pred = model_fingerprint(xb)

        # loss calculation
        loss = criterion(pred, yb)
        running_loss += loss.item()
        if (i+1) % 3 == 0:    
            print('[Iter: %5d] Validation loss: %.3f' %
                    (i + 1, running_loss / (i+1)))
[Iter:     3] Validation loss: 0.442
[Iter:     6] Validation loss: 0.599
[Iter:     9] Validation loss: 0.708
[Iter:    12] Validation loss: 0.851
{% endraw %}

Regression using both feature maps

If we want to use both the feature maps, we have to process the training data differently.

{% raw %}

class DoubleFeatureData[source]

DoubleFeatureData(*args, **kwds) :: Dataset

Process single feature map for model training. y: target X: tuple of two feature maps

{% endraw %} {% raw %}
{% endraw %}

Now we can feed both the feature maps to the model as a tuple

{% raw %}
double_feature = DoubleFeatureData(data.y, (X, X_fingerprint))
{% endraw %} {% raw %}
train_double, val_double, test_double = random_split(double_feature, [904,112,112], generator=torch.Generator().manual_seed(7))
{% endraw %} {% raw %}
len(train_double), len(val_double), len(test_double)
(904, 112, 112)
{% endraw %} {% raw %}
train_loader_double = DataLoader(train_double, batch_size=8, shuffle=True)
val_loader_double = DataLoader(val_double, batch_size=8, shuffle=True)
test_loader_double = DataLoader(test_double, batch_size=8, shuffle=True)
{% endraw %}

And we can get one batch of data by making the data loader iterable

{% raw %}
x, t = next(iter(train_loader_double))
{% endraw %} {% raw %}
t.shape
torch.Size([8, 1])
{% endraw %} {% raw %}
x1, x2 = x
x1.shape, x2.shape
(torch.Size([8, 13, 37, 37]), torch.Size([8, 12, 126, 126]))
{% endraw %}

And regression. Different feature maps have different number of channels.

{% raw %}
model_double = MolMapRegression(conv_in1=13, conv_in2=12)

epochs = 5
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model_double.to(device)
optimizer = optim.Adam(model_double.parameters(), lr=0.001)
criterion = nn.MSELoss()
{% endraw %}

And the training loop

{% raw %}
for epoch in range(epochs):

    running_loss = 0.0
    for i, ((x1, x2), yb) in enumerate(train_loader_double):

        x1, x2, yb = x1.to(device), x2.to(device), yb.to(device)

        # zero gradients
        optimizer.zero_grad()

        # forward propagation
        pred = model_double((x1, x2))

        # loss calculation
        loss = criterion(pred, yb)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if (i+1) % 50 == 0:    
            print('[Epoch: %d, Iter: %5d] Training loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / (i+1)))

print('Training finished')
[Epoch: 1, Iter:    50] Training loss: 6.050
[Epoch: 1, Iter:   100] Training loss: 4.608
[Epoch: 2, Iter:    50] Training loss: 2.854
[Epoch: 2, Iter:   100] Training loss: 2.403
[Epoch: 3, Iter:    50] Training loss: 1.834
[Epoch: 3, Iter:   100] Training loss: 1.536
[Epoch: 4, Iter:    50] Training loss: 1.027
[Epoch: 4, Iter:   100] Training loss: 0.996
[Epoch: 5, Iter:    50] Training loss: 0.595
[Epoch: 5, Iter:   100] Training loss: 0.677
Training finished
{% endraw %}

Loss on validation data set

{% raw %}
running_loss = 0.0
with torch.no_grad():
    for i, ((x1, x2), yb) in enumerate(val_loader_double):

        x1, x2, yb = x1.to(device), x2.to(device), yb.to(device)

        # forward propagation
        pred = model_double((x1, x2))

        # loss calculation
        loss = criterion(pred, yb)
        running_loss += loss.item()
        if (i+1) % 3 == 0:    
            print('[Iter: %5d] Validation loss: %.3f' %
                    (i + 1, running_loss / (i+1)))

print('Validation finished')
[Iter:     3] Validation loss: 1.191
[Iter:     6] Validation loss: 1.088
[Iter:     9] Validation loss: 1.002
[Iter:    12] Validation loss: 0.874
Validation finished
{% endraw %}